Deep learning‐based multi‐omics study reveals the polymolecular phenotypic of diabetic kidney disease

Dear Editor, Approximately 30% to 40% of patients with type 2 diabetes mellitus (T2DM) develop diabetic kidney disease (DKD), and most will go on to develop end-stage renal disease.1 The presence of kidney disease complicates the management of patients with T2DM.2 Therefore, identifying biomarkers for the early diagnosis of DKD based on circulatingmolecular factors associated with physiological


Dear Editor,
Approximately 30% to 40% of patients with type 2 diabetes mellitus (T2DM) develop diabetic kidney disease (DKD), and most will go on to develop end-stage renal disease. 1 The presence of kidney disease complicates the management of patients with T2DM. 2 Therefore, identifying biomarkers for the early diagnosis of DKD based on circulating molecular factors associated with physiological TA B L E 1 Characteristics of the participants included in the discovery set. alterations in patients with T2DM can effectively reduce and delay the incidence of DKD. We used deep learning (DL) to analyze and process multi-omics data and establish key molecular characteristics (biomarker panels) that affect the incidence and development of DKD. Based on strict diagnostic inclusion and exclusion criteria, 405 subjects from two centers in China were included in the discovery (n = 105) and test (n = 300) sets and  divided into healthy control (HC), T2DM, and DKD groups (Table 1 and Supplementary Materials).
In the discovery set, the combination of lipidomics and data-independent acquisition quantitative proteomics enabled the discovery of additional potential biomarkers and pathological mechanisms related to the occurrence and development of DKD. Lipidomics revealed that the metabolic profile of the both disease group changed significantly compared to that of HC; however, the metabolic profiles of T2DM and DKD groups were relatively similar ( Figure 1A). Using the criteria of variable importance in projection > 1 and p < .05, 70 differential serum metabolites (Table S2) were identified ( Figure 1B and Figure  S1A). These mainly involved metabolic pathways, such as sphingolipid metabolism, steroid hormone biosynthesis, glycerol phospholipid metabolism and arachidonic acid metabolism ( Figure 1C). In addition, the distribution of lipid abundance and lipid classes among the all groups showed that the glycerolipid and glycerophospholipid proportions were the highest.
Proteomic data showed that protein content may vary depending on the physiological state of the individual ( Figure 1D). With fold change (≥ 1.5 or ≤ .67) and p < .05 as screening criteria, 219 differential proteins were quantified ( Figure S1B and Figure 1E-F), most of which were highly expressed in the both disease group (Table S3). In addition, the Gene Ontology and Kyoto Encyclopedia of Genes and Genomes analyses of the 219 proteins showed that complement and coagulation cascades, focal adhesions and phagosomes were significantly enriched, revealing that the development of DKD was related to pro-inflammatory signals ( Figures S1C and D).
Research is increasingly focusing on applying multiomics to identify 'at-risk' profiles. 3 At present, biomarkers for the risk of diabetes progressing to DKD at the singlemolecule level have been identified; however, their diagnostic efficacy is poor. 4,5 DKD is a complex secondary disease, and studies on risk markers at multiple molecular levels would be helpful in reflecting disease risk. 6 We used support vector machine and convolutional neural network (CNN) models to evaluate the accuracy of single-or multi-omics and found that the CNN model in multi-omics showed significant advantages (Table S4), with the highest internal and prediction accuracies (100% and 90.48%, respectively). The neighborhood component analysis algorithm selected 58 fusion features (20%) from the 289 features, including 32 different proteins and 26 different lipids.
To reveal the intrinsic association of the 58 fusion features with DKD, Pearson correlation coefficient analysis was performed (Figure 2A). Twelve lipid metabolites showed significant association (R > .5) with 26 differentially expressed proteins ( Figure 2B). By plotting the relative abundance of these lipid metabolites, we observed that the vast majority of lipids were significantly enriched in patients with T2DM than those with DKD ( Figure 2C) and showed a linear increase with disease progression. A strong positive correlation between trihydroxycoprostanoic acid, Cer (d18:1/16:0), and 3α, 7 α-dihydroxycoprostanic acid was observed ( Figure 2D, R > .85, p < .01). These results suggest that DKDrelated proteins are associated with changes in serum lipid metabolite levels.
In the test set, four lipid metabolites and four proteins in the 58 fusion features showed similar trends and content changes as that in the discovery set (Tables S5 and 6). Recently, several clinical histological studies have focused on the concept of "biomarker panel". 2,7,8 Based on the above results, we selected 3α, 7α-dihydroxycoprostanic acid and Cer (d18:1/16:0) with an absolute high contribution to draw the receiver operating characteristic curve, with an area under the curve (AUC) of .800 (95% confidence interval [CI]: .698-.902), to establish the diagnostic distinction between T2DM and DKD ( Figure 2E). Subsequently, the remaining six substances were added to obtain the best biomarker panel to predict the development of DKD, which was composed of 3α, 7α -dihydroxycoprostanic acid, Cer (d18:1/16:0), cyclase-associated protein 1 (CAP1) and talin-1 (TLN1) (AUC = .873; 95% CI: .794-.951) ( Figure 2F and S2A-B).
In conclusion, this study combined multiple bioinformatic tools and learning algorithms to synthetically identify the optimal diagnosis of a disease biomarker panel. Our findings provide insights for the integrated modelling of multi-omics data and new research opportunities for T2DM complications. Furthermore, the combined use of two powerful histological techniques, lipidomics and proteomics, provided a comprehensive understanding of this disease. 9,10 The advent of DL will enable the handling of large amounts of high-dimensional and complex-structured data, further enabling the identification of key metabolic features.

LIMITATIONS
This study used training models from small populations to validate large cohorts because of complications such as sample collection and time constraints, which may have resulted in some features being neglected. Therefore, in future studies, attention should be paid to the cohort settings (usually 8:1 to 4:1).